Getting Started
- 作业介绍
- recitation-测试
- homework-改进代码
- 提前设置
make <target> DEBUG=1
Recitation: Perf and Cachegrind
Perf
安装 Perf
出现了问题,已解决,但是没有记录
Perf 个人使用速查
perf record
用法:
$ perf record <program_name> <program_arguments>
running ./isort n k will sort an array of n elements k times.
$ sudo perf record ./isort 10000 10
Sorting 10000 values...
Done!
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.035 MB perf.data (887 samples) ]
perf report
用法:直接perf report
$ sudo perf report
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 887 of event 'cpu-clock:pppH'
# Event count (approx.): 221750000
#
# Overhead Command Shared Object Symbol
# ........ ....... ................. .................
#
99.44% isort isort [.] isort
0.23% isort libc-2.27.so [.] rand_r
0.11% isort [kernel.kallsyms] [k] queue_work_on
0.11% isort [kernel.kallsyms] [k] release_pages
0.11% isort isort [.] main
这里就可以看到 isort 占据了很大一部分时间
Cachegrind
Cachegrind (a Valgrind tool) is a cache and branch-prediction profiler On virtual environments like those on AWS, hardware events providing information about branches and cache misses are often unavailable, so perf may not be helpful.
valgrind
使用:
$ valgrind --tool=cachegrind --branch-sim=yes <program_name> <program_arguments>
输出解释:
D1 represents the lowest-level cache (L1)
LL represents the last (highest) level data cache (on most machines, L3)
例子代码
// Copyright (c) 2012 MIT License by 6.172 Staff
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
typedef uint32_t data_t;
const int U = 10000000; // size of the array. 10 million vals ~= 40MB
const int N = 100000000; // number of searches to perform
int main() {
data_t* data = (data_t*) malloc(U * sizeof(data_t));
if (data == NULL) {
free(data);
printf("Error: not enough memory\n");
exit(-1);
}
// fill up the array with sequential (sorted) values.
int i;
for (i = 0; i < U; i++) {
data[i] = i;
}
printf("Allocated array of size %d\n", U);
printf("Summing %d random values...\n", N);
data_t val = 0;
data_t seed = 42;
for (i = 0; i < N; i++) {
int l = rand_r(&seed) % U;
val = (val + data[l]);
}
free(data);
printf("Done. Value = %d\n", val);
return 0;
}
编译
$ make sum
剖析
$ valgrind --tool=cachegrind --branch-sim=yes ./sum
输出
==125== Cachegrind, a cache and branch-prediction profiler
==125== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==125== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==125== Command: ./sum
==125==
--125-- warning: L3 cache found, using its data for the LL simulation.
Allocated array of size 10000000
Summing 100000000 random values...
Done. Value = 938895920
==125==
==125== I refs: 3,440,227,630
==125== I1 misses: 1,195
==125== LLi misses: 1,180
==125== I1 miss rate: 0.00%
==125== LLi miss rate: 0.00%
==125==
==125== D refs: 610,074,405 (400,058,343 rd + 210,016,062 wr)
==125== D1 misses: 100,507,416 ( 99,881,550 rd + 625,866 wr)
==125== LLd misses: 37,859,997 ( 37,234,195 rd + 625,802 wr)
==125== D1 miss rate: 16.5% ( 25.0% + 0.3% )
==125== LLd miss rate: 6.2% ( 9.3% + 0.3% )
==125==
==125== LL refs: 100,508,611 ( 99,882,745 rd + 625,866 wr)
==125== LL misses: 37,861,177 ( 37,235,375 rd + 625,802 wr)
==125== LL miss rate: 0.9% ( 1.0% + 0.3% )
==125==
==125== Branches: 210,043,840 (110,043,336 cond + 100,000,504 ind)
==125== Mispredicts: 5,456 ( 5,248 cond + 208 ind)
==125== Mispred rate: 0.0% ( 0.0% + 0.0% )
查找 Cache 信息
lscpu
用法:
$ lscpu
作用:
find information about your CPU and its caches
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 20
On-line CPU(s) list: 0-19
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 154
Model name: 12th Gen Intel(R) Core(TM) i7-12700H
Stepping: 3
CPU MHz: 2687.998
BogoMIPS: 5375.99
Virtualization: VT-x
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 48K
L1i cache: 32K
L2 cache: 1280K
L3 cache: 24576K
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm serialize flush_l1d arch_capabilities
在我们这里就可以看到 L1、L2 和 L3 的 Cache 信息了
再观察输出结果,因为我们写 Cache miss 少(代码顺序写)和读 Cache miss 多(代码随机读) 然后我们通过改变 U 和 N,控制 U 小于 L1、L2、L3 Cache 等情况,就能看出 N、U 大小和 D、L、LLD Cache miss 的关系
Homework: Sorting
Write-up 1:
Compare the Cachegrind output on the DEBUG=1 code versus DEBUG=0 compiler optimized code. Explain the advantages and disadvantages of using instruction count as a substitute for time when you compare the performance of different versions of this program
$ make
clang main.c tests.c util.c isort.c sort_a.c sort_c.c sort_i.c sort_p.c sort_m.c sort_f.c -O3 -DNDEBUG -g -Wall -std=gnu99 -gdwarf-3 -always-inline -lrt -lm -o sort
clang: warning: argument unused during compilation: '-always-inline' [-Wunused-command-line-argument]
$ valgrind --tool=cachegrind --branch-sim=yes ./sort 10000 10
==184== Cachegrind, a cache and branch-prediction profiler
==184== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==184== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==184== Command: ./sort 10000 10
==184==
--184-- warning: L3 cache found, using its data for the LL simulation.
Running test #0...
Generating random array of 10000 elements
Arrays are sorted: yes
--> test_correctness at line 217: PASS
sort_a : Elapsed execution time: 0.014364 sec
sort_a repeated : Elapsed execution time: 0.013956 sec
Generating inverted array of 10000 elements
Arrays are sorted: yes
--> test_correctness at line 217: PASS
sort_a : Elapsed execution time: 0.027513 sec
sort_a repeated : Elapsed execution time: 0.027134 sec
Running test #1...
--> test_zero_element at line 245: PASS
Running test #2...
--> test_one_element at line 266: PASS
Done testing.
==184==
==184== I refs: 235,925,874
==184== I1 misses: 1,586
==184== LLi misses: 1,467
==184== I1 miss rate: 0.00%
==184== LLi miss rate: 0.00%
==184==
==184== D refs: 87,546,459 (52,667,725 rd + 34,878,734 wr)
==184== D1 misses: 228,442 ( 127,084 rd + 101,358 wr)
==184== LLd misses: 5,123 ( 2,420 rd + 2,703 wr)
==184== D1 miss rate: 0.3% ( 0.2% + 0.3% )
==184== LLd miss rate: 0.0% ( 0.0% + 0.0% )
==184==
==184== LL refs: 230,028 ( 128,670 rd + 101,358 wr)
==184== LL misses: 6,590 ( 3,887 rd + 2,703 wr)
==184== LL miss rate: 0.0% ( 0.0% + 0.0% )
==184==
==184== Branches: 40,474,926 (38,773,975 cond + 1,700,951 ind)
==184== Mispredicts: 2,484,878 ( 2,484,522 cond + 356 ind)
==184== Mispred rate: 6.1% ( 6.4% + 0.0% )
$ make clean
rm -f ./sort *.std* *.gcov *.gcda *.gcno default.profraw
$ make DEBUG=1
clang main.c tests.c util.c isort.c sort_a.c sort_c.c sort_i.c sort_p.c sort_m.c sort_f.c -DDEBUG -O0 -g -Wall -std=gnu99 -gdwarf-3 -always-inline -lrt -lm -o sort
clang: warning: argument unused during compilation: '-always-inline' [-Wunused-command-line-argument]
$ valgrind --tool=cachegrind --
ranch-sim=yes ./sort 10000 10valgrind: no program specified
valgrind: Use --help for more information.
cjl@ChenJulian:~/solution/mit6.172/Homework/HW2/homework$ valgrind --tool=cachegrind --branch-sim=yes ./sort 10000 10
==206== Cachegrind, a cache and branch-prediction profiler
==206== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==206== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==206== Command: ./sort 10000 10
==206==
--206-- warning: L3 cache found, using its data for the LL simulation.
Running test #0...
Generating random array of 10000 elements
Arrays are sorted: yes
--> test_correctness at line 217: PASS
sort_a : Elapsed execution time: 0.025243 sec
sort_a repeated : Elapsed execution time: 0.025381 sec
Generating inverted array of 10000 elements
Arrays are sorted: yes
--> test_correctness at line 217: PASS
sort_a : Elapsed execution time: 0.049681 sec
sort_a repeated : Elapsed execution time: 0.049848 sec
Running test #1...
--> test_zero_element at line 245: PASS
Running test #2...
--> test_one_element at line 266: PASS
Done testing.
==206==
==206== I refs: 408,412,706
==206== I1 misses: 1,553
==206== LLi misses: 1,457
==206== I1 miss rate: 0.00%
==206== LLi miss rate: 0.00%
==206==
==206== D refs: 260,697,695 (194,384,270 rd + 66,313,425 wr)
==206== D1 misses: 228,025 ( 126,767 rd + 101,258 wr)
==206== LLd misses: 5,115 ( 2,411 rd + 2,704 wr)
==206== D1 miss rate: 0.1% ( 0.1% + 0.2% )
==206== LLd miss rate: 0.0% ( 0.0% + 0.0% )
==206==
==206== LL refs: 229,578 ( 128,320 rd + 101,258 wr)
==206== LL misses: 6,572 ( 3,868 rd + 2,704 wr)
==206== LL miss rate: 0.0% ( 0.0% + 0.0% )
==206==
==206== Branches: 45,174,708 ( 43,473,813 cond + 1,700,895 ind)
==206== Mispredicts: 3,109,490 ( 3,109,160 cond + 330 ind)
==206== Mispred rate: 6.9% ( 7.2% + 0.0% )
得出结果
DEBUG | 0 | 1 |
---|---|---|
时间 | 快 | 慢 |
指令数 | 少 | 多 |
Cache Misses 率 | 大 | 小 |
MISpredicts 率 | 小 | 大 |
O0 牺牲了 cache 命中率,但是指令数少,时间快
inlining
You would like to see how much inline functions can help. Copy over the code from sort_a.c into sort_i.c, and change all the routine names from
_a to _i. Using the inline keyword, inline one or more of the functions in sort_i.c and util.c. To add the sort_i routine to the testing suite, uncomment the line in main.c, under testFunc, that specifies sort_i. Profile and annotate the inlined program.
Write-up 2:
Explain which functions you chose to inline and report the performance differences you observed between the inlined and uninlined sorting routines.
如题,将 sort_i 中的 merge_i 和 copy_i 设置为 inline。并进行测试
编译和 perf record 的过程不再给出
- no use inlining & DEBUG=1
cjl@ChenJulian:/mnt/c/Data Files/mit6.172/Homework/HW2/homework$ perf record ./sort 10000 10
Running test #0...
Generating random array of 10000 elements
Arrays are sorted: yes
--> test_correctness at line 217: PASS
sort_a : Elapsed execution time: 0.000694 sec
sort_a repeated : Elapsed execution time: 0.000679 sec
sort_i : Elapsed execution time: 0.000679 sec
Generating inverted array of 10000 elements
Arrays are sorted: yes
--> test_correctness at line 217: PASS
sort_a : Elapsed execution time: 0.000950 sec
sort_a repeated : Elapsed execution time: 0.000989 sec
sort_i : Elapsed execution time: 0.000942 sec
Running test #1...
--> test_zero_element at line 245: PASS
Running test #2...
--> test_one_element at line 266: PASS
Done testing.
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.006 MB perf.data (116 samples) ]
cjl@ChenJulian:/mnt/c/Data Files/mit6.172/Homework/HW2/homework$ perf report
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 116 of event 'cpu-clock:uhpppH'
# Event count (approx.): 29000000
#
# Overhead Command Shared Object Symbol
# ........ ....... ............. ...........................
#
43.10% sort sort [.] sort_a
17.24% sort libc-2.27.so [.] cfree@GLIBC_2.2.5
16.38% sort libc-2.27.so [.] malloc
15.52% sort sort [.] sort_i
2.59% sort sort [.] mem_alloc
1.72% sort sort [.] mem_free
0.86% sort libc-2.27.so [.] intel_check_word.isra.0
0.86% sort libc-2.27.so [.] rand_r
0.86% sort sort [.] free@plt
0.86% sort sort [.] malloc@plt
#
# (Tip: Show current config key-value pairs: perf config --list)
#
cjl@ChenJulian:/mnt/c/Data Files/mit6.172/Homework/HW2/homework$ valgrind --tool=cachegrind --branch-sim=yes ./sort 10000 10
==698== Cachegrind, a cache and branch-prediction profiler
==698== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==698== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==698== Command: ./sort 10000 10
==698==
--698-- warning: L3 cache found, using its data for the LL simulation.
Running test #0...
Generating random array of 10000 elements
Arrays are sorted: yes
--> test_correctness at line 217: PASS
sort_a : Elapsed execution time: 0.014621 sec
sort_a repeated : Elapsed execution time: 0.014605 sec
sort_i : Elapsed execution time: 0.014851 sec
Generating inverted array of 10000 elements
Arrays are sorted: yes
--> test_correctness at line 217: PASS
sort_a : Elapsed execution time: 0.028227 sec
sort_a repeated : Elapsed execution time: 0.028304 sec
sort_i : Elapsed execution time: 0.028753 sec
Running test #1...
--> test_zero_element at line 245: PASS
Running test #2...
--> test_one_element at line 266: PASS
Done testing.
==698==
==698== I refs: 351,917,014
==698== I1 misses: 1,624
==698== LLi misses: 1,494
==698== I1 miss rate: 0.00%
==698== LLi miss rate: 0.00%
==698==
==698== D refs: 130,160,086 (78,415,144 rd + 51,744,942 wr)
==698== D1 misses: 324,704 ( 184,171 rd + 140,533 wr)
==698== LLd misses: 5,123 ( 2,421 rd + 2,702 wr)
==698== D1 miss rate: 0.2% ( 0.2% + 0.3% )
==698== LLd miss rate: 0.0% ( 0.0% + 0.0% )
==698==
==698== LL refs: 326,328 ( 185,795 rd + 140,533 wr)
==698== LL misses: 6,617 ( 3,915 rd + 2,702 wr)
==698== LL miss rate: 0.0% ( 0.0% + 0.0% )
==698==
==698== Branches: 60,182,795 (57,681,766 cond + 2,501,029 ind)
==698== Mispredicts: 3,703,747 ( 3,703,346 cond + 401 ind)
==698== Mispred rate: 6.2% ( 6.4% + 0.0% )
- use inlining & DEBUG=1
$ cjl@ChenJulian:/mnt/c/Data Files/mit6.172/Homework/HW2/homework$ perf record ./sort 10000 10
Running test #0...
Generating random array of 10000 elements
Arrays are sorted: yes
--> test_correctness at line 217: PASS
sort_a : Elapsed execution time: 0.001043 sec
sort_a repeated : Elapsed execution time: 0.001049 sec
sort_i : Elapsed execution time: 0.001048 sec
Generating inverted array of 10000 elements
Arrays are sorted: yes
--> test_correctness at line 217: PASS
sort_a : Elapsed execution time: 0.001652 sec
sort_a repeated : Elapsed execution time: 0.001609 sec
sort_i : Elapsed execution time: 0.001595 sec
Running test #1...
--> test_zero_element at line 245: PASS
Running test #2...
--> test_one_element at line 266: PASS
Done testing.
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.009 MB perf.data (203 samples) ]
cjl@ChenJulian:/mnt/c/Data Files/mit6.172/Homework/HW2/homework$ perf report
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 203 of event 'cpu-clock:uhpppH'
# Event count (approx.): 50750000
#
# Overhead Command Shared Object Symbol
# ........ ....... ............. ................................
#
34.98% sort sort [.] merge_a
15.27% sort sort [.] merge_i
12.32% sort sort [.] copy_a
9.85% sort libc-2.27.so [.] cfree@GLIBC_2.2.5
8.87% sort sort [.] copy_i
5.42% sort sort [.] sort_a
3.45% sort libc-2.27.so [.] malloc
1.97% sort sort [.] copy_data
1.97% sort sort [.] mem_alloc
1.97% sort sort [.] sort_i
0.99% sort sort [.] init_data
0.99% sort sort [.] mem_free
0.49% sort ld-2.27.so [.] strcmp
0.49% sort libc-2.27.so [.] __fxstat64
0.49% sort libc-2.27.so [.] __memmove_avx_unaligned_erms
0.49% sort sort [.] malloc@plt
#
# (Tip: Customize output of perf script with: perf script -F event,ip,sym)
#
cjl@ChenJulian:/mnt/c/Data Files/mit6.172/Homework/HW2/homework$ valgrind --tool=cachegrind --branch-sim=yes ./sort 10000 10
==758== Cachegrind, a cache and branch-prediction profiler
==758== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==758== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==758== Command: ./sort 10000 10
==758==
--758-- warning: L3 cache found, using its data for the LL simulation.
Running test #0...
Generating random array of 10000 elements
Arrays are sorted: yes
--> test_correctness at line 217: PASS
sort_a : Elapsed execution time: 0.025676 sec
sort_a repeated : Elapsed execution time: 0.025433 sec
sort_i : Elapsed execution time: 0.025869 sec
Generating inverted array of 10000 elements
Arrays are sorted: yes
--> test_correctness at line 217: PASS
sort_a : Elapsed execution time: 0.050344 sec
sort_a repeated : Elapsed execution time: 0.050162 sec
sort_i : Elapsed execution time: 0.050949 sec
Running test #1...
--> test_zero_element at line 245: PASS
Running test #2...
--> test_one_element at line 266: PASS
Done testing.
==758==
==758== I refs: 608,332,662
==758== I1 misses: 1,574
==758== LLi misses: 1,481
==758== I1 miss rate: 0.00%
==758== LLi miss rate: 0.00%
==758==
==758== D refs: 388,800,501 (289,890,848 rd + 98,909,653 wr)
==758== D1 misses: 323,971 ( 183,785 rd + 140,186 wr)
==758== LLd misses: 5,122 ( 2,419 rd + 2,703 wr)
==758== D1 miss rate: 0.1% ( 0.1% + 0.1% )
==758== LLd miss rate: 0.0% ( 0.0% + 0.0% )
==758==
==758== LL refs: 325,545 ( 185,359 rd + 140,186 wr)
==758== LL misses: 6,603 ( 3,900 rd + 2,703 wr)
==758== LL miss rate: 0.0% ( 0.0% + 0.0% )
==758==
==758== Branches: 67,436,323 ( 64,935,328 cond + 2,500,995 ind)
==758== Mispredicts: 4,682,231 ( 4,681,844 cond + 387 ind)
==758== Mispred rate: 6.9% ( 7.2% + 0.0% )
我们可以看到内联函数会导致时间花费更长。
在这个写作任务中,你需要解释递归函数内联化可能导致的性能下降,并说明使用 Cachegrind 收集的分析数据如何帮助你测量这些负面性能影响。
递归函数内联化的潜在性能问题包括:
- 代码膨胀:内联递归函数会导致代码膨胀,因为每次递归调用都会展开为相应的代码,这可能会导致生成更多的指令。代码膨胀可能会增加指令缓存的压力,降低缓存命中率。
- 栈消耗:递归函数通常使用函数调用栈来保存每个递归调用的状态。内联递归函数可能会导致栈消耗过大,尤其在递归深度较大时。较大的栈消耗可能会导致栈溢出或者减慢程序的执行速度。
- 冗余计算:内联展开递归函数的过程中,可能会进行一些冗余计算,因为相同的计算可能在不同的展开代码中多次出现。这会增加指令执行的开销,降低程序的效率。
Pointers vs Arrays
Write-up 4
Give a reason why using pointers may improve performance. Report on any performance differences you observed in your implementation.
指针
Coarsening
Write-up 5
Explain what sorting algorithm you used and how you chose the number of elements to be sorted in the base case. Report on the performance differences you observed.
- 未优化
代码
static inline void merge_m(data_t *A, int p, int q, int r)
{
assert(A);
assert(p <= q);
assert((q + 1) <= r);
int n1 = q - p + 1;
int n2 = r - q;
data_t *left = 0, *right = 0;
mem_alloc(&left, n1 + 1);
mem_alloc(&right, n2 + 1);
if (left == NULL || right == NULL)
{
mem_free(&left);
mem_free(&right);
return;
}
copy_m(&(A[p]), left, n1);
copy_m(&(A[q + 1]), right, n2);
left[n1] = UINT_MAX;
right[n2] = UINT_MAX;
int i = 0;
int j = 0;
for (int k = p; k <= r; k++)
{
if (left[i] <= right[j])
{
A[k] = left[i];
i++;
}
else
{
A[k] = right[j];
j++;
}
}
mem_free(&left);
mem_free(&right);
}
cjl@ChenJulian:/mnt/c/Data Files/mit6.172/Homework/HW2/homework$ valgrind --tool=cachegrind --branch-sim=yes ./sort 10000 10
==41== Cachegrind, a cache and branch-prediction profiler
==41== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==41== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==41== Command: ./sort 10000 10
==41==
--41-- warning: L3 cache found, using its data for the LL simulation.
Running test #0...
Generating random array of 10000 elements
Arrays are sorted: yes
--> test_correctness at line 217: PASS
sort_m : Elapsed execution time: 0.031639 sec
Generating inverted array of 10000 elements
Arrays are sorted: yes
--> test_correctness at line 217: PASS
sort_m : Elapsed execution time: 0.062058 sec
Running test #1...
--> test_zero_element at line 245: PASS
Running test #2...
--> test_one_element at line 266: PASS
Done testing.
==41==
==41== I refs: 208,493,318
==41== I1 misses: 1,554
==41== LLi misses: 1,458
==41== I1 miss rate: 0.00%
==41== LLi miss rate: 0.00%
==41==
==41== D refs: 132,595,246 (98,877,918 rd + 33,717,328 wr)
==41== D1 misses: 131,781 ( 69,494 rd + 62,287 wr)
==41== LLd misses: 5,116 ( 2,413 rd + 2,703 wr)
==41== D1 miss rate: 0.1% ( 0.1% + 0.2% )
==41== LLd miss rate: 0.0% ( 0.0% + 0.0% )
==41==
==41== LL refs: 133,335 ( 71,048 rd + 62,287 wr)
==41== LL misses: 6,574 ( 3,871 rd + 2,703 wr)
==41== LL miss rate: 0.0% ( 0.0% + 0.0% )
==41==
==41== Branches: 22,912,991 (22,012,175 cond + 900,816 ind)
==41== Mispredicts: 1,558,061 ( 1,557,736 cond + 325 ind)
==41== Mispred rate: 6.8% ( 7.1% + 0.0% )
- 优化后:
报错,感觉有点花费时间。先不细究了。